Random Forest

Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma

Random Forest Introduction

  • Ensemble machine learning method based on a large number of decision trees voting to predict a classification
  • Benefits compared to decision tree:
    • Able to function with incomplete data
    • Lower likelihood of an overfit
    • Improved prediction accuracy

Random Forest Applications

  • Banking
    • Fraud Detection
    • Loan Default Risk
  • Business
    • Predictive Advertising

Methods

  • Bagging
  • Boosting
  • Random Feature Selection
  • Cross Validation
  • Ensemble voting and Prediction

Bootstrap Sampling (Bagging)

  • Each decision tree uses a random sample of the original dataset
    • Using a subset of the dataset reduces the probability of an overfit model
    • Rows with missing data will often be left out of the sample, improving performance
    • Performed with replacement

Boosting

  • When individual models are trained in a sequential way, each model then learns the mistakes made by preceding model.

Random Feature Selection

  • A random set of features is selected for each node in training
    • Information about feature importance may be saved and applies in future iterations
    • Even with automated random feature selection, feature selection and engineering prior to training may improve performance

Code
ctrl <- trainControl(method = "cv",  number = 10) 

bagged_cv <- train(
  Group~ LF + GF + Biomes + Range + Habitat_degradation +  
     Habitat_loss + IAS + Other + Unknown + Other + Over_exploitation,
  data    = species_train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE)

plot(varImp(bagged_cv), 10)

Cross Validation

  • Validation of performance of model
    • Resampling method similar to bootstrapping, but without replacement
    • Allows approximation of the general performance of a model

Code
 m3 <- rpart(
   formula = Group~ LF + GF + Biomes + Range +
     Habitat_degradation + Habitat_loss + IAS +
     Other + Unknown + Other + Over_exploitation,
   data    = species_train,
   method  = "anova"
 )
 rpart.plot(m3)

Ensemble Voting and Prediction

  • Each trained decision tree produces its own prediction
    • Decision trees are independent, and were trained on different subsets of both data and features
  • The results from each decision tree are combined into a voting classifier
    • The mode of the classification results will be the final prediction

Data Preparation

  • Preprocessing
    • Encode categorical features into numerical / factor features
    • Split the training set into a training and test set, minimizing class imbalance

Preprocessing

  • Class Imbalance
    • Resample smaller classes in order to approximate equal classes
    • Training on imbalanced datasets will bias predictions to the larger class

Prediction

  • Combine results into a vector
    • \(Y=\{y_1,y_2,y_3,y_4,y_5\}\)
  • Identify the most frequently predicted class
    • \(y_{final}=\text{mode}(Y)\)
  • Iterate over entire test set, storing results
  • Generate a confusion matrix, calculate the sensitivity, and precision for each category
  • Iterate after tuning if necessary

##Evaluation - Four metrics are calculated using the test set - \(\text{Accuracy}=\frac{\sum{\left(\text{Actual Label} = \text{Predicted Label}\right)}}{\text{Label Count}}\) - \(\text{Recall}=\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\) - \(\text{Precision}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positives}}\) - \(\text{F1}=\frac{2*(\text{Precision}*\text{Recall})}{\text{Precision}+\text{Recall}}\)

Dataset and Exploration

  • South African Red List
    • Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
  • Purpose
    • Predict whether or not an unknown plant is threatened based on the above characteristics

Distribution of Range by Conservation Status

  • While there are a small number of threatened species with a large range, it is clear that Range is likely a strong predictor of Group status

  • A lower range predicts a higher likelihood of threatened or extinct grouping.

Code
ggplot(data = data, aes(x = Status, y = Range, fill = Status)) +
  geom_boxplot() +
  theme_bw() +
  ylim(0,100000)

Feature Associations

  • Cramer’s V Association with Range binned into 20 categories
    • Target feature Group is most associated with Range, Family, Habitat Loss, Biome, and GF
    • The most associated features will likely be the most important features during model training
    • Colinearity does not appear to be present, further checks are

Code
corrDF <- train %>% mutate(Range=ntile(Range, n=20))
corrDF %<>% mutate_at(c("Group","LF","GF","Biomes","Range","Habitat_degradation","Habitat_loss","IAS","Other","Over_exploitation","Pollution","Unknown"),factor)
corrDF <- corrDF %>% mutate(Range=ntile(Range, n=20))
corrplot::corrplot(DescTools::PairApply(corrDF,DescTools::CramerV), type='lower')

Analysis

  • The data was processed to allow it to be modeled effectively using a random forest
  • 5 separate random forest models were created using separate methods of normalization

Data Processing

  • Process the data by setting the first 14 columns as [features] and the last column as the [label]
  • Split the dataset into training and testing sets
  • Combine the training datasets
  • Print the initial number of each category

Class Balancing

  • Process classes A and B
  • Process classes A and C
  • Retain records in: -data_train_AB_resampled where the label is ‘2’
    • data_train_AC_resampled where the label is ‘3’
    • Both data_train_AB_resampled and data_train_AC_resampled where the label is ‘1’
  • combine
data_train_AB <- data_train
data_train_AB <- data_train_AB[data_train_AB$label 
                      != '3',]
data_train_AB_resampled <- ovun.sample(label ~ .,
                      data = data_train_AB, method = "over", 
                      N = 980, seed = 1)$data

data_train_AC <- data_train
data_train_AC <- data_train_AC[data_train_AC$label
                      != '2',]
data_train_AC_resampled <- ovun.sample(label ~ .,
                      data = data_train_AC,
                      method = "over", N = 980,
                      seed = 1)$data

data_train_AB_2 <- data_train_AB_resampled[data_train_AB_resampled$label
                      == '2',]
data_train_AC_3 <- data_train_AC_resampled[data_train_AC_resampled$label
                      == '3',]

data_train_1 <- data_train_AB_resampled[data_train_AB_resampled$label 
                      == '1',]
data_train_combined <- rbind(data_train_1, data_train_AB_2, data_train_AC_3)

cat("Group Counts Pre-Balancing:  ",table(data_train$label),
        "\nGroup Counts Post-Balancing: ",table(data_train_combined$label))
Group Counts Pre-Balancing:   490 148 23 
Group Counts Post-Balancing:  490 490 490

Model Training

\(X_{new}=\frac{X_{old}-\min(X_{old})}{\max(X_{old})-\min(x_{old})}\)

          Score
Accuracy   0.88
Recall     0.79
Precision  0.82
F1         0.81
features_train_1 <- as.data.frame(lapply(features_train, 
                      function(x) {(x-min(x))/(max(x)-min(x))}))
features_test_1 <- as.data.frame(lapply(features_test, 
                      function(x) {(x-min(x))/(max(x)-min(x))}))

data_train_1 <- features_train_1
data_train_1$label <- label
class_counts_1 <- table(data_train_1$label)

model_1 <- randomForest(x = data_train_1[-ncol(data_train_combined)],
                        y = as.factor(data_train_1$label), ntree = 2)
variable_importance_1 = importance(model_1)
pred_comb_1 <- predict(model_1, features_test_1)
accuracy <- sum(label_test == pred_comb_1) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_1)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"], 
                 cm$byClass["Class: 2", "Sensitivity"], 
                 cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"], 
                    cm$byClass["Class: 2", "Pos Pred Value"], 
                    cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                  ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall',
                       'Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}-\bar{X}_{old}}{\sigma_{X_{old}}}\)

          Score
Accuracy   0.40
Recall     0.53
Precision  0.41
F1         0.46
features_train_2 <- as.data.frame(lapply(features_train, 
                      function(x) {(x - mean(x))/sd(x)}))
features_test_2 <- as.data.frame(lapply(features_test, 
                      function(x) {(x - mean(x))/sd(x)}))

data_train_2 <- features_train_2
data_train_2$label <- label
class_counts_2 <- table(data_train_2$label)

model_2 <- randomForest(x = data_train_2[-ncol(data_train_combined)],
                      y = as.factor(data_train_2$label), ntree = 2)
variable_importance_2 = importance(model_2)
pred_comb_2 <- predict(model_2, features_test_2)
accuracy <- sum(label_test == pred_comb_2) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_2)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))

precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                  ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}}{\max(|X_{old}|)}\)

          Score
Accuracy   0.88
Recall     0.71
Precision  0.79
F1         0.75
features_train_3 <- as.data.frame(lapply(features_train, 
                       function(x) {x / max(abs(x))}))
features_test_3 <- as.data.frame(lapply(features_test, 
                        function(x) {x / max(abs(x))}))

data_train_3 <- features_train_3
data_train_3$label <- label
class_counts_3 <- table(data_train_3$label)

model_3 <- randomForest(x = data_train_3[-ncol(data_train_combined)], 
                        y = as.factor(data_train_3$label), 
                        ntree = 2)
variable_importance_3 = importance(model_3)
pred_comb_3 <- predict(model_3, features_test_3)
accuracy <- sum(label_test == pred_comb_3) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_3)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                 cm$byClass["Class: 2", "Sensitivity"],
                 cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                 cm$byClass["Class: 2", "Pos Pred Value"],
                 cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                  ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}-\min(X_{old})}{\max(X_{old})-\min(x_{old})}\)

          Score
Accuracy   0.77
Recall     0.57
Precision  0.54
F1         0.56
features_train_4 <- as.data.frame(lapply(features_train, 
                       function(x) {x / sum(abs(x))}))
features_test_4 <- as.data.frame(lapply(features_test, 
                       function(x) {x / sum(abs(x))}))

data_train_4 <- features_train_4
data_train_4$label <- label
class_counts_4 <- table(data_train_4$label)

model_4 <- randomForest(x = data_train_4[-ncol(data_train_combined)], 
                        y = as.factor(data_train_4$label), ntree = 2)
variable_importance_4 = importance(model_4)
pred_comb_4 <- predict(model_4, features_test_4)
accuracy <- sum(label_test == pred_comb_4) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_4)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                    ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}-\bar{X}_{old}}{\sigma_{X_{old}}}\)

          Score
Accuracy   0.75
Recall     0.43
Precision  0.49
F1         0.46
features_train_5 <- as.data.frame(lapply(features_train, 
                       function(x) {x / sqrt(sum(x^2))}))
features_test_5 <- as.data.frame(lapply(features_test, 
                       function(x) {x / sqrt(sum(x^2))}))

data_train_5 <- features_train_5
data_train_5$label <- label
class_counts_5 <- table(data_train_5$label)

model_5 <- randomForest(x = data_train_5[-ncol(data_train_combined)], 
                        y = as.factor(data_train_5$label), ntree = 2)
variable_importance_5 = importance(model_5)
pred_comb_5 <- predict(model_5, features_test_5)
accuracy <- sum(label_test == pred_comb_5) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_5)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)

recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                 cm$byClass["Class: 2", "Sensitivity"],
                 cm$byClass["Class: 3", "Sensitivity"]))

precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                 cm$byClass["Class: 2", "Pos Pred Value"],
                 cm$byClass["Class: 3", "Pos Pred Value"]))

F1 = 2 * recall * precision / ( recall + precision )

printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),ncol=1,
                    byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')

print(printTable)

Prediction

  • Obtain the number of predicted results
  • Initialize an empty vector to store the final prediction results
  • Iterate over each test sample
  • Get the prediction results of the five models for the i-th sample
  • Select the most frequently predicted class as the final prediction result for the i-th sample
  • Now final_pred contains the prediction results after voting
  • Calculate and print the accuracy
  • Convert to a factor type
  • Obtain the confusion matrix
  • Calculate the recall (Sensitivity) for each category
  • Calculate the precision for each category
Code
n <- length(pred_comb_1)

final_pred <- rep(NA, n)

for(i in 1:n) {
   preds <- c(pred_comb_1[i], pred_comb_2[i], pred_comb_3[i], 
              pred_comb_4[i], pred_comb_5[i])

   final_pred[i] <- as.numeric(names(which.max(table(preds))))
}

importances_list <- list(variable_importance_1, variable_importance_2, 
                         variable_importance_3, variable_importance_4, 
                         variable_importance_5)
average_importance <- Reduce("+", importances_list) / length(importances_list)
print(average_importance)

accuracy <- sum(label_test == final_pred) / length(label_test)
print(paste('Accuracy of Voting method:', accuracy))

final_pred_factor <- as.factor(final_pred)
label_test_factor <- as.factor(label_test)

cm_vote <- confusionMatrix(final_pred_factor, label_test_factor)

sensitivity_class1 <- cm_vote$byClass["Class: 1", "Sensitivity"]
sensitivity_class2 <- cm_vote$byClass["Class: 2", "Sensitivity"]
sensitivity_class3 <- cm_vote$byClass["Class: 3", "Sensitivity"]
recall = (sensitivity_class1 + sensitivity_class2 + sensitivity_class3) / 3
print(paste('Recall :', recall))

precision_class1 <- cm_vote$byClass["Class: 1", "Pos Pred Value"]
precision_class2 <- cm_vote$byClass["Class: 2", "Pos Pred Value"]
precision_class3 <- cm_vote$byClass["Class: 3", "Pos Pred Value"]
precision = (precision_class1 + precision_class2 + precision_class3) / 3
print(paste('Precision :', precision))

F1 = 2 * recall * precision / ( recall + precision )
print(paste('F1 :', F1))
Code
ctrl <- trainControl(method = "cv",  number = 10) 

bagged_cv <- train(
  Group~ LF + GF + Biomes + Range +
    Habitat_degradation + Habitat_loss + IAS +
    Other + Unknown + Other + Over_exploitation,
  data    = species_train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE
)
 
plot(varImp(bagged_cv), 10) 

Confusion Matrix

  • Process classes A and B
  • Process classes A and C
  • Retain records in:
    • data_train_AB_resampled where the label is ‘2’
    • data_train_AC_resampled where the label is ‘3’
    • Both data_train_AB_resampled and data_train_AC_resampled where the label is ‘1’
    • combine
Code
cm_vote <- confusionMatrix(final_pred_factor, label_test_factor)
 
library(ggplot2)    
library(grid)
library(gridExtra)          
library(likert)
 
cm_vote <- confusionMatrix(final_pred_factor, label_test_factor)
 
cm <- confusionMatrix(final_pred_factor, label_test_factor)
cm
Confusion Matrix and Statistics

          Reference
Prediction   1   2   3
         1 200  12   1
         2  10  51   0
         3   0   0   9

Overall Statistics
                                          
               Accuracy : 0.9187          
                 95% CI : (0.8805, 0.9478)
    No Information Rate : 0.742           
    P-Value [Acc > NIR] : 3.035e-14       
                                          
                  Kappa : 0.7929          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.9524   0.8095  0.90000
Specificity            0.8219   0.9545  1.00000
Pos Pred Value         0.9390   0.8361  1.00000
Neg Pred Value         0.8571   0.9459  0.99635
Prevalence             0.7420   0.2226  0.03534
Detection Rate         0.7067   0.1802  0.03180
Detection Prevalence   0.7527   0.2155  0.03180
Balanced Accuracy      0.8871   0.8820  0.95000
Code
cm_d <- as.data.frame(cm$table)
cm_st <-data.frame(cm$overall)
cm_st$cm.overall <- round(cm_st$cm.overall,2)
cm_d$diag <- cm_d$Prediction == cm_d$Reference
cm_d$ndiag <- cm_d$Prediction != cm_d$Reference     
cm_d[cm_d == 0] <- NA
cm_d$Reference <-  reverse.levels(cm_d$Reference)
cm_d$ref_freq <- cm_d$Freq * ifelse(is.na(cm_d$diag),-1,1) 
 
plt1 <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, 
                                 fill = Freq))+
  scale_x_discrete(position = "top") +
  geom_tile( data = cm_d,aes(fill = ref_freq)) +
  scale_fill_gradient2(guide = FALSE ,low="red",high="mediumvioletred", 
                       mid= "mistyrose",
                    midpoint = 0,na.value = 'white') +
  geom_text(aes(label = Freq), color = 'black', size = 3)+
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.border = element_blank(),
        plot.background = element_blank(),
        axis.line = element_blank(),
  )
plt2 <-  tableGrob(cm_st)
grid.arrange(plt1, plt2, nrow = 1, ncol = 2,
            top=textGrob("Confusion Matrix",gp=gpar(fontsize=25,font=1)))

Conclusion

Random Forest is a powerful and flexible machine learning algorithm that can be used for a wide range of tasks. It is particularly useful when dealing with complex data composed of a large number of features, and when the goal is to achieve high predictive accuracy while avoiding overfitting. The algorithm incorporates versatility in its capabilities for classification and regression tasks, handling missing data, and displaying robustness when faced with outliers and noisy data. Most extinctions were perennial shrubs found in the Cape Floristic Region. As range was the strongest predictor of extinction, many of the recorded taxa deemed susceptible were range-restricted. Habitat loss is presented as the second strongest variable of importance in predicting plant extinctions. Predictions were based on a quantitative, evidence-based approach, though gaps in knowledge highlighted areas for further study. Improved species monitoring and documentation of threat factors will aid in a deeper understanding of the ecological role and value of South African plant species.